This is the start of my course diary (check the link). Updates coming soon.
In this course I will learn about:
Lets read the data from a local folder and see how many observations and variables we have to work with.
learning2014 <- read.csv("data/learning2014.csv",row.names = "X")
dim(learning2014)
## [1] 166 7
str(learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
summary(learning2014)
## gender age attitude deep stra
## F:110 Min. :17.00 Min. :1.400 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :3.200 Median :3.667 Median :3.188
## Mean :25.51 Mean :3.143 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :5.000 Max. :4.917 Max. :5.000
## surf points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
We have 166 observations and 7 variables from a survey collected from a statistics course. There are two times more females in the data than males. The median age is 22. Attitude, deep, stra, surf are measured in Likert scale. Attitude is a combination of motivation to learn statistics and confidence in their mathematical skills. Deep is a combination of questions related to how people evaluate information and willingness to understand or apply it. Stra measures strategic behavior; how an individual works and plans. Surf on the other hand measures if people are not interested in their studies or concentrate on course requirements rather than on fully understanding the subject, that is, “surface learning” in other words. Points is the maximal overall points that the student got - or “performance”. From the code-blocks above, one can see some vital information about the variables.
Let’s show a graphical overview of the data.
library(ggplot2)
library(GGally)
theme_set(theme_light(base_size=9))
ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
The above graph summarizes all the relevant graphs of the variables distributions colored based on gender. As one can see, men seem to have a better attitude, less inclined for surface learning and perform slightly better. On the other hand, most females describe themselves as deeper learners and display a higher degree of strategic learning patterns. Another noticiable thing is that men are on average older than their female counterparts. Reader should recognize the different correlations between the variables. Most notable is the high correlation between exam performance and attitude points and the negative correlation of points and surface learning.
In this section we try to find a suitable regression formula that summarizes the performance (points) of the students well. We start by trying out some models (commented out) and finally arrive to a model we believe is a good approximation on the variables that have a biggest effect on performance based on statistical signifficance of the variables used.
model1 <- lm(points ~ deep + surf + gender, data = learning2014)
summary(model1)
##
## Call:
## lm(formula = points ~ deep + surf + gender, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.8988 -3.6363 0.3912 4.2481 10.7905
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.8145 4.7353 6.296 2.75e-09 ***
## deep -0.6964 0.8700 -0.800 0.4246
## surf -1.7461 0.9159 -1.906 0.0584 .
## genderM 0.9827 0.9680 1.015 0.3115
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.857 on 162 degrees of freedom
## Multiple R-squared: 0.03062, Adjusted R-squared: 0.01267
## F-statistic: 1.706 on 3 and 162 DF, p-value: 0.1679
#remove unsignifficant variables and try a different model
model2 <- lm(points ~ surf + age + stra, data = learning2014 )
summary(model2)
##
## Call:
## lm(formula = points ~ surf + age + stra, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.8229 -3.5885 0.4698 4.1271 9.7896
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.28166 3.68767 7.127 3.2e-11 ***
## surf -1.56432 0.87103 -1.796 0.0744 .
## age -0.09642 0.05885 -1.638 0.1033
## stra 1.04286 0.59393 1.756 0.0810 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.792 on 162 degrees of freedom
## Multiple R-squared: 0.05205, Adjusted R-squared: 0.03449
## F-statistic: 2.965 on 3 and 162 DF, p-value: 0.03377
#remove unsignifficant variables and try a different model
model3 <- lm(points ~ surf + stra + attitude, data = learning2014 )
summary(model3)
##
## Call:
## lm(formula = points ~ surf + stra + attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## surf -0.5861 0.8014 -0.731 0.46563
## stra 0.8531 0.5416 1.575 0.11716
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
#remove unsignifficant variables and try a different model
model4 <- lm(points ~ attitude + deep + gender, data = learning2014 )
summary(model4)
##
## Call:
## lm(formula = points ~ attitude + deep + gender, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.0364 -3.2315 0.3561 3.9436 11.0859
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.6240 3.1799 4.284 3.13e-05 ***
## attitude 3.6657 0.5984 6.125 6.61e-09 ***
## deep -0.6172 0.7546 -0.818 0.415
## genderM -0.4633 0.9170 -0.505 0.614
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.337 on 162 degrees of freedom
## Multiple R-squared: 0.1953, Adjusted R-squared: 0.1804
## F-statistic: 13.1 on 3 and 162 DF, p-value: 1.054e-07
#remove unsignifficant variables and try a different model
my_model <- lm(points ~ age + attitude + stra, data= learning2014)
summary(my_model)
##
## Call:
## lm(formula = points ~ age + attitude + stra, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.1149 -3.2003 0.3303 3.4129 10.7599
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.89543 2.64834 4.114 6.17e-05 ***
## age -0.08822 0.05302 -1.664 0.0981 .
## attitude 3.48077 0.56220 6.191 4.72e-09 ***
## stra 1.00371 0.53434 1.878 0.0621 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.26 on 162 degrees of freedom
## Multiple R-squared: 0.2182, Adjusted R-squared: 0.2037
## F-statistic: 15.07 on 3 and 162 DF, p-value: 1.07e-08
The iterated model is
points ~ age + attitude + stra
We can interpret our model as follows: if we have one point higher attitude (Likert scale), our test scores are on average three and a half points higher with a standard error of roughly 0.56. Age and strategic behaviour are statistically signifficant at p-value 10%, thus we exclude their interpretations since it might be just chance these variables have an effect on test scores which differs from zero. The adjusted R-squared, or the goodness-of-fit, is 0.2037 and the F-statistic is statistically signifficant. This means that the fit is good and it has statistical signifficance in explaining the test score performance. Note that the normal R-squared increases as one adds variables to the model, the adjusted one does not.
library(ggplot2)
par(mfrow = c(2,2))
plot(my_model, which(c(T,T,F,F,T,F)))
plot(density(my_model$residuals), main="Distibution of residuals")
In the plot with fitted values and residuals it is fairly clear that there is no apparent pattern between the two. This suggests that the size of the errors do not depend greatly on the explanatory variables used. We can see from the Q-Q plot that there is a reasonable fit on the line, meaning that the residuals are roughly normally distributed. This is demonstrated on the fourth plot where one can see the distribution of the residuals. On the leverage plot one can see that there are couple of observations which are quite different meaning that these observations might have a big impact on the results, though the numbers are small and we can still be confident in our model.
We can be fairly confident in our model that attitude has the biggest effect on exam performance. Multiple diagnostics of this model confirmed that there are no apparent reason to doubt that the residuals are correlated with the explanatory variables, the residuals are not normally distributed, or our results are based on very high impacting or leveraging outliers.
Let’s first read the data from a local folder and print out the column names. This file has data on student alcohol consumption, descriptive variables, and most importantly number of absences and final grade (G3 - from 0 to 20). We will analyse the relatinship between alcohol consumption (binary variable high_use) to other variables (G3 test grade, mothers education, gender, and absences). Hypothesis: i) from previous studies the mothers educational background usually has had a good explanatory strength in the performance/behavior of their children. Furthermore, ii) males tend to drink more, iii) and absences captures the “dropout” behavior in the data. iv) Low test performance likely affects (and other way as well) tendency to drink because students are, in lack of other words, demoralized or simply sad. v) Freetime and goout measure the reported ‘boredom’ or neglect of coursework as well as amount of socializing of the students.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
summarise(group_by(alc, high_use), testin_k_arvo = mean(G3), testin_hajonta = sd(G3))
## # A tibble: 2 × 3
## high_use testin_k_arvo testin_hajonta
## <lgl> <dbl> <dbl>
## 1 FALSE 11.73507 3.385467
## 2 TRUE 10.80702 3.042115
tarkasteltavat_muuttujat <- c("high_use","G3","Medu","sex","absences","freetime","goout")
data <- select(alc, one_of(tarkasteltavat_muuttujat))
glimpse(data)
## Observations: 382
## Variables: 7
## $ high_use <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE...
## $ G3 <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 12,...
## $ Medu <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3,...
## $ sex <fctr> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F, F...
## $ absences <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, 3,...
## $ freetime <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3,...
## $ goout <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2,...
library(ggplot2)
library(tidyr)
gather(data) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") +geom_bar()
## Warning: attributes are not identical across measure variables; they will
## be dropped
Next we check the frequencies of high usage among different sexes and variables: grades (G3), absences, and Mothers education. We also plot boxplots to check the distribution among these subgroups.
library(dplyr)
data %>% group_by(sex,high_use) %>% summarise(count = n(),mean_grade = mean(G3))
## Source: local data frame [4 x 4]
## Groups: sex [?]
##
## sex high_use count mean_grade
## <fctr> <lgl> <int> <dbl>
## 1 F FALSE 156 11.39744
## 2 F TRUE 42 11.71429
## 3 M FALSE 112 12.20536
## 4 M TRUE 72 10.27778
ggplot(alc, aes(x = high_use, y = G3, col=sex)) + geom_boxplot() + ylab("grade") + ggtitle("High Alcohol Usage and Test Performance")
data %>% group_by(sex,high_use) %>% summarise(count = n(), mean_absences = mean(absences))
## Source: local data frame [4 x 4]
## Groups: sex [?]
##
## sex high_use count mean_absences
## <fctr> <lgl> <int> <dbl>
## 1 F FALSE 156 4.224359
## 2 F TRUE 42 6.785714
## 3 M FALSE 112 2.982143
## 4 M TRUE 72 6.125000
ggplot(alc, aes(x = high_use, y = absences, col=sex)) + geom_boxplot() + ylab("absences") + ggtitle("High Alcohol Usage and absences")
data %>% group_by(sex,high_use) %>% summarise(count = n(), mean_mothersEdu = mean(Medu))
## Source: local data frame [4 x 4]
## Groups: sex [?]
##
## sex high_use count mean_mothersEdu
## <fctr> <lgl> <int> <dbl>
## 1 F FALSE 156 2.673077
## 2 F TRUE 42 2.785714
## 3 M FALSE 112 2.973214
## 4 M TRUE 72 2.847222
ggplot(alc, aes(x = high_use, y = Medu, col=sex)) + geom_boxplot() + ylab("Mothers Education") + ggtitle("High Alcohol Usage and Mothers Education")
data %>% group_by(sex,high_use) %>% summarise(count = n(), mean_freetime = mean(freetime))
## Source: local data frame [4 x 4]
## Groups: sex [?]
##
## sex high_use count mean_freetime
## <fctr> <lgl> <int> <dbl>
## 1 F FALSE 156 2.929487
## 2 F TRUE 42 3.357143
## 3 M FALSE 112 3.392857
## 4 M TRUE 72 3.500000
ggplot(alc, aes(x = high_use, y = freetime, col=sex)) + geom_boxplot() + ylab("Freetime") + ggtitle("High Alcohol Usage and freetime")
data %>% group_by(sex,high_use) %>% summarise(count = n(), mean_goout = mean(goout))
## Source: local data frame [4 x 4]
## Groups: sex [?]
##
## sex high_use count mean_goout
## <fctr> <lgl> <int> <dbl>
## 1 F FALSE 156 2.955128
## 2 F TRUE 42 3.357143
## 3 M FALSE 112 2.714286
## 4 M TRUE 72 3.930556
ggplot(alc, aes(x = high_use, y = goout, col=sex)) + geom_boxplot() + ylab("Spend time with friends") + ggtitle("High Alcohol Usage and Going out")
It appears that test performance is lowe especially for men for high users of alcohol. Similar effect can be found for absences; high users of acohol are more likely to miss classes, especially for men. We did not find a clear difference based on Mothers education (our initial hypothesis seems to have been wrong) and therefore change our attention to freetime. High users seem to have more freetime, this might be due to neglect of work or just plain boredom. Similarly, individuals which are more social or have less preference to study tend to be higher drinkers.
Next we fit a logistic model to explain high usage of alcohol. Our model is the following
high_use ~ freetime + absences + sex + goout
# find the model with glm()
m <- glm(high_use ~ freetime + absences + sex + goout, data = alc, family = "binomial")
# print out a summary of the model
summary(m)
##
## Call:
## glm(formula = high_use ~ freetime + absences + sex + goout, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7938 -0.8060 -0.5306 0.8297 2.4887
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.33104 0.57522 -7.529 5.10e-14 ***
## freetime 0.07254 0.13705 0.529 0.596610
## absences 0.08487 0.02240 3.789 0.000151 ***
## sexM 0.93806 0.25762 3.641 0.000271 ***
## goout 0.71094 0.12444 5.713 1.11e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 387.47 on 377 degrees of freedom
## AIC: 397.47
##
## Number of Fisher Scoring iterations: 4
# print out the coefficients of the model
coef(m)
## (Intercept) freetime absences sexM goout
## -4.33104146 0.07253827 0.08487446 0.93806336 0.71094375
#ODDS ratio of 1/6, individual is 1/6 likely to succeed with X, the exponent of the coefficients: exp(beta) = odds(Y|x+1)/odds(Y|x)
# compute odds ratios (OR)
OR <- coef(m) %>% exp
# compute confidence intervals (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.01315384 0.00406496 0.03896248
## freetime 1.07523395 0.82191021 1.40863837
## absences 1.08858040 1.04303580 1.14007640
## sexM 2.55502845 1.55086699 4.26706152
## goout 2.03591174 1.60487130 2.61688539
When goout (spending time with friends) is included in the model it seems that freetime does not have explanatory power in the model. From the print above, one can see that amount of absences, gender, and social activity seem to explain alcohol usage quite well. For example, an individual answering that they spend more time with friends (Likert scale) are twice as likely to use high amounts of alcohol, everything else held constant. Likewise, males are much more likely to spend more alcohol and students who are absent have a small risk factor. Note that freetime is not signifficant at p-value 10% (or 5%) and the confidence interwal includes 1 for odds ratios, which means that the model suggest that there is no clear causality weather students whom spend more freetime have a higher or lower risk to consume high amounts of alchol. The results show that the initial hypothesis on social behaviour, gender, and amount absences seem to be correct but ones on Mothers education (based on the graph above) or freetime spent were not.
Now we try to make predictions and cross-validation based on the model used.
# find the model with glm()
m <- glm(high_use ~ freetime + absences + sex + goout, data = alc, family = "binomial")
# predict() the probability of high_use
probabilities <- predict(m, type = "response")
# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 250 18
## TRUE 64 50
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use, col=prediction))
# define the geom as points and draw the plot
g + geom_point()
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table %>% addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.65445026 0.04712042 0.70157068
## TRUE 0.16753927 0.13089005 0.29842932
## Sum 0.82198953 0.17801047 1.00000000
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2146597
# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2225131
Based on the definition of high users our model estimates that 64 individuals would not be high users when infact they were, 18 would be high users when they were not and 50 where ‘correcly’ identified as high users. This is also shown in a graph with predictions as real values. Table shows the probabilities that the model is either correct or incorrect (type I and type II errors can be easily calculated from this). Noteworthy is that the loss function is lower (0.21) than the baseline (0.26), which suggests that our model seems to work better than the baseline model of including failures, absence and sex as explanatory variables. This is also the training error.
Here we study the Boston data-set. There’s 504 observations and 14 variables. Basic information about the variables: crim is the per capita crime rate by town, dis is the weighted mean of distances to five Boston employment centres, ptratio is the pupil-teacher ratio by town, medv is the median value of owner-occupied homes per 1000 dollars. As we can see from the summary, crime is highly skewed with outliers. However, the ptratio and medv are distributed ‘fairly’ evenly.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(dplyr)
library(tidyr)
library(ggplot2)
library(corrplot)
data("Boston")
dim(Boston)
## [1] 506 14
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
glimpse(Boston)
## Observations: 506
## Variables: 14
## $ crim <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, ...
## $ zn <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5,...
## $ indus <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, ...
## $ chas <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ nox <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524...
## $ rm <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172...
## $ age <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0,...
## $ dis <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605...
## $ rad <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, ...
## $ tax <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, ...
## $ black <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60...
## $ lstat <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.9...
## $ medv <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, ...
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
pairs(Boston)
plot(density(Boston$crim))
plot(density(Boston$ptratio))
plot(density(Boston$medv))
#calculate the correlations and plot them
cor_matrix <- cor(Boston)
cor_matrix %>% round(digits=2) %>% corrplot.mixed(tl.cex = 0.7,number.cex=0.7,order="hclust",addrect=2)
From the correlation pairs plot we can see the relevant correlations between the variables. As one could expect industrial areas are related to nitrogen oxides concentration and negatively correlated with residential areas and employment centers. Crime is correlated among others with radial highways and industrial areas.
Next we standardize the data sets. The distributions are easier to see with standardized variables. Let’s divide the crime variable to a factor of size four: “low” to “high”. We divide the data to trainingsets and testsets with a share of 80%.
boston_scaled <- scale(Boston)
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
boston_scaled <- as.data.frame(boston_scaled)
scaled_crim <- boston_scaled$crim
summary(scaled_crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419400 -0.410600 -0.390300 0.000000 0.007389 9.924000
bins <- quantile(scaled_crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
crime <- cut(scaled_crim, breaks= bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
crime
## [1] low low low low low low med_low
## [8] med_low med_low med_low med_low med_low med_low med_high
## [15] med_high med_high med_high med_high med_high med_high med_high
## [22] med_high med_high med_high med_high med_high med_high med_high
## [29] med_high med_high med_high med_high med_high med_high med_high
## [36] low med_low low med_low low low med_low
## [43] med_low med_low med_low med_low med_low med_low med_low
## [50] med_low med_low low low low low low
## [57] low low med_low med_low med_low med_low med_low
## [64] med_low low low low low med_low med_low
## [71] med_low med_low med_low med_low low med_low med_low
## [78] med_low low med_low low low low low
## [85] low low low low low low low
## [92] low low low low med_low med_low med_low
## [99] low low med_low med_low med_low med_low med_low
## [106] med_low med_low med_low med_low med_high med_low med_low
## [113] med_low med_low med_low med_low med_low med_low med_low
## [120] med_low low low med_low med_low med_low med_low
## [127] med_high med_high med_high med_high med_high med_high med_high
## [134] med_high med_high med_high med_high med_high med_low med_high
## [141] med_high med_high med_high high med_high med_high med_high
## [148] med_high med_high med_high med_high med_high med_high med_high
## [155] med_high med_high med_high med_high med_high med_high med_high
## [162] med_high med_high med_high med_high med_high med_high med_high
## [169] med_high med_high med_high med_high med_low med_low med_low
## [176] low low low low low low low
## [183] med_low med_low med_low low low low med_low
## [190] med_low med_low low med_low low low low
## [197] low low low low low low low
## [204] low low med_low med_low med_low med_low med_high
## [211] med_low med_high med_low med_low med_high med_low low
## [218] low med_low med_low med_high med_high med_high med_high
## [225] med_high med_high med_high med_high med_high med_high med_high
## [232] med_high med_high med_high med_high med_high med_high med_high
## [239] med_low med_low med_low med_low med_low med_low med_low
## [246] med_low med_high med_low med_low med_low med_low med_low
## [253] med_low med_high low low low med_high med_high
## [260] med_high med_high med_high med_high med_high med_high med_high
## [267] med_high med_high med_high med_low med_high med_low med_low
## [274] med_low low med_low med_low low low med_low
## [281] low low low low low low low
## [288] low low low low low low med_low
## [295] low med_low low med_low low low low
## [302] low med_low med_low low low low low
## [309] med_high med_high med_high med_high med_high med_high med_high
## [316] med_low med_high med_low med_high med_high med_low med_low
## [323] med_high med_high med_high med_low med_high med_low low
## [330] low low low low low low low
## [337] low low low low low low low
## [344] low low low low low low low
## [351] low low low low low med_low high
## [358] high high high high high high high
## [365] med_high high high high high high high
## [372] high high high high high high high
## [379] high high high high high high high
## [386] high high high high high high high
## [393] high high high high high high high
## [400] high high high high high high high
## [407] high high high high high high high
## [414] high high high high high high high
## [421] high high high high high high high
## [428] high high high high high high high
## [435] high high high high high high high
## [442] high high high high high high high
## [449] high high high high high high high
## [456] high high high high high high high
## [463] high high high med_high high high high
## [470] high high high med_high high high high
## [477] high high high high high high high
## [484] med_high med_high med_high high high med_low med_low
## [491] med_low med_low med_low med_low med_high med_low med_high
## [498] med_high med_low med_low med_low low low low
## [505] med_low low
## Levels: low med_low med_high high
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
n <- nrow(boston_scaled)
n
## [1] 506
index <- sample(n, size = n * 0.8)
index
## [1] 394 462 396 5 126 354 295 297 188 32 121 148 107 163 305 329 250
## [18] 172 115 204 264 76 173 339 100 467 501 33 446 437 438 127 135 331
## [35] 84 241 214 490 371 152 407 356 97 132 198 477 213 83 475 327 185
## [52] 219 302 175 232 366 488 460 71 150 251 247 311 98 169 393 455 120
## [69] 78 310 19 246 443 89 41 108 506 259 372 226 258 256 197 419 430
## [86] 294 65 151 239 364 382 112 325 403 290 315 257 62 209 427 31 376
## [103] 17 405 300 266 384 203 134 352 483 255 42 478 90 408 465 413 238
## [120] 391 367 208 7 374 57 113 166 289 67 170 377 406 131 236 8 347
## [137] 2 79 200 268 320 395 92 283 28 275 357 11 72 359 426 341 457
## [154] 299 243 125 217 249 322 350 480 155 421 230 502 145 222 260 309 276
## [171] 1 410 282 234 215 368 487 474 133 358 439 415 69 167 355 379 240
## [188] 423 450 176 224 442 493 77 269 195 389 409 495 267 494 298 491 183
## [205] 212 324 351 39 397 91 448 292 400 262 160 273 482 59 194 10 87
## [222] 317 86 182 449 440 16 291 481 58 498 207 321 144 399 95 119 253
## [239] 420 452 179 186 123 147 286 171 353 466 301 124 369 103 431 463 114
## [256] 174 122 412 205 308 229 505 180 416 277 307 306 383 73 454 61 432
## [273] 444 56 279 74 190 244 130 362 44 184 335 111 343 154 48 82 116
## [290] 192 221 24 346 313 181 37 138 288 118 47 492 161 485 459 334 445
## [307] 458 216 336 3 378 202 263 461 285 245 388 429 101 333 293 365 9
## [324] 476 469 64 23 270 303 342 433 363 85 196 157 6 80 35 104 486
## [341] 146 66 29 316 242 418 473 40 252 361 159 117 165 93 265 453 199
## [358] 380 36 281 434 51 503 168 189 428 375 470 26 304 50 248 178 153
## [375] 489 472 447 227 237 233 193 451 404 330 280 21 201 4 162 38 385
## [392] 106 370 479 143 312 25 129 52 500 386 323 20 164
trainingset <- boston_scaled[index,]
testset <- boston_scaled[-index,]
correct_classes <- testset$crime
testset <- dplyr::select(testset, -crime)
Next we fit a lda model. We try to expain crime rate with other variables in the data. First level LDA does a good job on dividing the data. There are two main ‘groups’ high crime areas are vastly different from the rest groups. We can see that high crime is closely related to being close to radial highways (variable rad). This seems to be the dividing factor. From the testset and prediction table we can see that prediction of high crime is accurate but there is some dispersion in the med_high to low crime predictions. Our model seems to work quite adequately.
lda.fit <- lda(crime ~ ., data = trainingset)
lda.fit
## Call:
## lda(crime ~ ., data = trainingset)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2549505 0.2475248 0.2376238 0.2599010
##
## Group means:
## zn indus chas nox rm
## low 1.00971304 -0.9420787 -0.15765625 -0.8964669 0.4754299
## med_low -0.06918811 -0.2661348 -0.07547406 -0.5737607 -0.1368375
## med_high -0.37915407 0.2015343 0.17879700 0.4630668 0.1476340
## high -0.48724019 1.0170492 -0.04735191 1.0582604 -0.4171019
## age dis rad tax ptratio
## low -0.8537188 0.8956460 -0.6919669 -0.7403760 -0.5099792
## med_low -0.2954614 0.3412147 -0.5397114 -0.4433276 -0.0796942
## med_high 0.4145037 -0.4163965 -0.4004598 -0.3061125 -0.4586876
## high 0.8092809 -0.8488686 1.6388211 1.5145512 0.7815834
## black lstat medv
## low 0.37546905 -0.78695096 0.5456772
## med_low 0.30499162 -0.12347789 -0.0140051
## med_high 0.05080184 -0.04645095 0.2202353
## high -0.73977387 0.89707666 -0.6732395
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.07429930 0.64759747 -0.86406772
## indus 0.04473036 -0.26308873 0.33263942
## chas -0.09549653 -0.04353417 0.05661754
## nox 0.38740700 -0.86329140 -1.31144646
## rm -0.10782652 -0.09198197 -0.17276358
## age 0.23322005 -0.17118015 -0.04374182
## dis -0.05804157 -0.18852476 0.11765028
## rad 3.15505806 0.83656790 -0.27161047
## tax 0.03229992 0.10376562 0.58040175
## ptratio 0.10469874 0.08146420 -0.15510958
## black -0.10264241 0.03780032 0.13670777
## lstat 0.19947897 -0.15757292 0.42849545
## medv 0.18994864 -0.37011661 -0.08298239
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9481 0.0382 0.0138
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
classes <- as.numeric(trainingset$crime)
plot(lda.fit, dimen =2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)
lda.pred <- predict(lda.fit, newdata = testset)
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 14 7 3 0
## med_low 3 21 2 0
## med_high 1 15 13 1
## high 0 0 0 22
Next we calculate the distance matrix to cluster the Boston data. As can be seen from pairs plot and ggpairs (this one is a bit messy), two centers seems to work quite well on the data. Number of centers is most appropriate when the total within sum of squares declines rapidly.
library(MASS)
data('Boston')
dist_euclidian <- dist(Boston)
summary(dist_euclidian)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.119 85.620 170.500 226.300 371.900 626.000
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(dist_euclidian, k)$tot.withinss})
plot(1:k_max, twcss, type='b')
km <- kmeans(dist_euclidian, centers=2)
pairs(Boston, col = c(km$cluster,alpha=0.2),pch=km$cluster)
library(GGally)
data.frame(km$cluster)[,1]
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [71] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1
## [106] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [141] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [176] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [211] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [246] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [281] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [316] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [351] 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [386] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [421] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [456] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [491] 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1
Boston2 <- mutate(Boston, cluster = as.factor(data.frame(km$cluster)[,1]))
ggpairs(Boston2, mapping = aes(col = cluster, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)), title ="K-means (2) colored plots")
## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero
## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero
## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero
## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero
## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero
## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero
## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero
## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero
## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero
## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero
## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero
## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero
## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero
Let’s first load the data we produced with the create_human.R script. This data is a collection of gross national income, expected education, life expectancy, education and labor shares between genders, maternal mortality rate, adolescent birth rate, and female political participation rate. In other words this data describes human development in countries.
We can see from the first plot below that GNI, maternal mortality rate, and adolescent birth rate are skewed and right tailed. High kurtosis and left tailed variables are: education ratio, labor ratio, and life expectancy. Expected education is distributed fairly similarly to a normal distribution.
From the second plot we can see that maternal mortality rate and adolescent birth rate are correlated and negatively correlated to high development variables such as: education ratio, GNI, and life expectancy.
human <- read.csv(file = "data/human.csv",row.names = "X")
head(human)
## Edu2.FM Labo.FM Life.Exp Edu.Exp GNI Mat.Mor Ado.Birth
## Norway 1.0072389 0.8908297 81.6 17.5 64992 4 7.8
## Australia 0.9968288 0.8189415 82.4 20.2 42261 6 12.1
## Switzerland 0.9834369 0.8251001 83.0 15.8 56431 6 1.9
## Denmark 0.9886128 0.8840361 80.2 18.7 44025 5 5.1
## Netherlands 0.9690608 0.8286119 81.6 17.9 45435 6 6.2
## Germany 0.9927835 0.8072289 80.9 16.5 43919 7 3.8
## Parli.F
## Norway 39.6
## Australia 30.5
## Switzerland 28.5
## Denmark 38.0
## Netherlands 36.9
## Germany 36.9
summary(human)
## Edu2.FM Labo.FM Life.Exp Edu.Exp
## Min. :0.1717 Min. :0.1857 Min. :49.00 Min. : 5.40
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:66.30 1st Qu.:11.25
## Median :0.9375 Median :0.7535 Median :74.20 Median :13.50
## Mean :0.8529 Mean :0.7074 Mean :71.65 Mean :13.18
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:77.25 3rd Qu.:15.20
## Max. :1.4967 Max. :1.0380 Max. :83.50 Max. :20.20
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
dim(human)
## [1] 155 8
str(human)
## 'data.frame': 155 obs. of 8 variables:
## $ Edu2.FM : num 1.007 0.997 0.983 0.989 0.969 ...
## $ Labo.FM : num 0.891 0.819 0.825 0.884 0.829 ...
## $ Life.Exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ Edu.Exp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ GNI : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ Mat.Mor : int 4 6 6 5 6 7 9 28 11 8 ...
## $ Ado.Birth: num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Parli.F : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
library(GGally)
library(corrplot)
library(tidyr)
library(dplyr)
# visualize the 'human_' variables
ggpairs(human)
# compute the correlation matrix and visualize it with corrplot
cor(human) %>% round(digits=2) %>% corrplot.mixed(tl.cex = 0.7,number.cex=0.7,order="hclust",addrect=2)
Let’s perform a PCA on the data. As we can see, without standardization the plot is quite a mess.
# perform principal component analysis (with the SVD method)
pca_human_nstd <- prcomp(human)
# rounded percetanges of variance captured by each PC
pca_pr <- round(1*summary(pca_human_nstd)$importance[2, ], digits = 5)*100
# print out the percentages of variance
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 99.99 0.01 0.00 0.00 0.00 0.00 0.00 0.00
# create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
# draw a biplot of the principal component representation and the original variables
biplot(pca_human_nstd, choices = 1:2, cex = c(0.4, 0.4), col = c("grey60","navyblue"), xlab = pc_lab[1], ylab = pc_lab[2])
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
Let’s standardize the data now and interprit the results. Without standardisation GNI seems to be orthogonal to all other components, this is because it’s variance is the largest. With standardisation we can see more clearly which components are correlated and which atribute to principal components 1 and 2. Again, the correlations we interpreted before hold. Since female participation and labor ratio point up, this means they are orhogonal, or do not correlate with other variables in the data.
This results seems appropriate, if in a country the child mortality rate is high or the mothers die younger, it is plausible that this has an effect to life expectancy directly but indirectly to expected education rate and therefore to gross national income. One could go as far as to say that the contribution of mothers or more generally females in an important factor in development (PC1). Other part contributing to welfare are the highly correlated components such as GNI, life expectancy, and education. Furhtermore education of females.
# standardize the variables
human_std <- scale(human)
# print out summaries of the standardized variables
summary(human_std)
## Edu2.FM Labo.FM Life.Exp Edu.Exp
## Min. :-2.8189 Min. :-2.6247 Min. :-2.7188 Min. :-2.7378
## 1st Qu.:-0.5233 1st Qu.:-0.5484 1st Qu.:-0.6425 1st Qu.:-0.6782
## Median : 0.3503 Median : 0.2316 Median : 0.3056 Median : 0.1140
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5958 3rd Qu.: 0.7350 3rd Qu.: 0.6717 3rd Qu.: 0.7126
## Max. : 2.6646 Max. : 1.6632 Max. : 1.4218 Max. : 2.4730
## GNI Mat.Mor Ado.Birth Parli.F
## Min. :-0.9193 Min. :-0.6992 Min. :-1.1325 Min. :-1.8203
## 1st Qu.:-0.7243 1st Qu.:-0.6496 1st Qu.:-0.8394 1st Qu.:-0.7409
## Median :-0.3013 Median :-0.4726 Median :-0.3298 Median :-0.1403
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3712 3rd Qu.: 0.1932 3rd Qu.: 0.6030 3rd Qu.: 0.6127
## Max. : 5.6890 Max. : 4.4899 Max. : 3.8344 Max. : 3.1850
# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human_std)
# rounded percetanges of variance captured by each PC
pca_pr <- round(1*summary(pca_human)$importance[2, ], digits = 5)*100
# print out the percentages of variance
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 53.605 16.237 9.571 7.583 5.477 3.595 2.634 1.298
# create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2, cex = c(0.4, 0.4), col = c("grey60","navyblue"), xlab = pc_lab[1], ylab = pc_lab[2])
Next we do a MCA on the tea dataset. There are 300 observations and 36 variables. Let’s cut down on the variables we analyze to a more manageable rate. We include the type of tea (black, green etc.), with or without bonuses (lemon, milk etc.), type of tea package (bag, loose etc.), price (cheap, upscale etc.), with or without sugar,where tea is enjoyed (store, tea shop etc.), and whether or not drinking is done with breakfast.
library(FactoMineR)
library(ggplot2)
library(tidyr)
data(tea)
dim(tea)
## [1] 300 36
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
#summary(tea)
# column names to keep in the dataset
# select the columns to keep to create a new dataset
library(dplyr)
tea_time <- select(tea,one_of(c("Tea","How","how","price","sugar","where","breakfast")))
# look at the summaries and structure of the data
#summary(tea_time)
str(tea_time)
## 'data.frame': 300 obs. of 7 variables:
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ breakfast: Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
#tea_time
#how, price, where
library(plyr)
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
#levels(tea_time$how)
tea_time$how <- mapvalues(tea_time$how, from = c("tea bag", "tea bag+unpackaged", "unpackaged"), to = c("bag", "both", "loose"))
#levels(tea_time$how)
#levels(tea_time$price)
tea_time$price <- mapvalues(tea_time$price, from = c("p_branded","p_cheap","p_private label","p_unknown","p_upscale","p_variable"),to = c("branded", "cheap", "private","unknown","upscale","variable"))
#levels(tea_time$price)
#levels(tea_time$where)
tea_time$where <- mapvalues(tea_time$where, from = c("chain store","chain store+tea shop","tea shop" ),to = c("store", "both", "shop"))
#levels(tea_time$where)
# visualize the dataset
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 25, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables; they will
## be dropped
# tea_time is available
# multiple correspondence analysis
mca <- MCA(tea_time, graph = FALSE)
# summary of the model
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.304 0.249 0.188 0.183 0.153 0.148
## % of var. 13.299 10.893 8.218 8.018 6.691 6.483
## Cumulative % of var. 13.299 24.192 32.410 40.428 47.119 53.603
## Dim.7 Dim.8 Dim.9 Dim.10 Dim.11 Dim.12
## Variance 0.147 0.142 0.132 0.125 0.115 0.103
## % of var. 6.439 6.197 5.781 5.477 5.026 4.521
## Cumulative % of var. 60.042 66.239 72.020 77.497 82.524 87.045
## Dim.13 Dim.14 Dim.15 Dim.16
## Variance 0.095 0.080 0.073 0.048
## % of var. 4.174 3.479 3.210 2.091
## Cumulative % of var. 91.219 94.698 97.909 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr
## 1 | -0.471 0.244 0.050 | 0.273 0.100 0.017 | 0.797 1.127
## 2 | -0.210 0.049 0.026 | -0.233 0.073 0.032 | 0.930 1.534
## 3 | -0.281 0.087 0.093 | 0.043 0.002 0.002 | -0.235 0.098
## 4 | -0.396 0.172 0.180 | 0.099 0.013 0.011 | -0.459 0.375
## 5 | -0.305 0.102 0.106 | -0.137 0.025 0.021 | 0.027 0.001
## 6 | -0.469 0.241 0.088 | 0.413 0.228 0.068 | -0.056 0.006
## 7 | -0.305 0.102 0.106 | -0.137 0.025 0.021 | 0.027 0.001
## 8 | -0.187 0.038 0.021 | -0.053 0.004 0.002 | 0.667 0.790
## 9 | 0.545 0.326 0.130 | -0.729 0.711 0.232 | 0.308 0.169
## 10 | 0.792 0.688 0.287 | -0.605 0.490 0.167 | 0.474 0.399
## cos2
## 1 0.143 |
## 2 0.511 |
## 3 0.065 |
## 4 0.243 |
## 5 0.001 |
## 6 0.001 |
## 7 0.001 |
## 8 0.267 |
## 9 0.042 |
## 10 0.103 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## black | 0.436 2.200 0.062 4.310 | -0.084 0.099 0.002
## Earl Grey | -0.223 1.510 0.090 -5.189 | -0.132 0.643 0.031
## green | 0.330 0.563 0.013 2.007 | 0.960 5.815 0.114
## alone | -0.010 0.003 0.000 -0.240 | 0.182 1.238 0.062
## lemon | 0.530 1.452 0.035 3.222 | -0.197 0.245 0.005
## milk | -0.305 0.920 0.025 -2.721 | -0.202 0.490 0.011
## other | 0.415 0.242 0.005 1.261 | -1.814 5.665 0.102
## how_bag | -0.581 8.987 0.441 -11.487 | 0.302 2.964 0.119
## how_both | 0.363 1.940 0.060 4.239 | -0.946 16.078 0.408
## how_loose | 1.796 18.183 0.440 11.466 | 1.043 7.495 0.148
## v.test Dim.3 ctr cos2 v.test
## black -0.829 | 1.224 28.116 0.491 12.113 |
## Earl Grey -3.065 | -0.397 7.702 0.284 -9.214 |
## green 5.835 | -0.425 1.510 0.022 -2.582 |
## alone 4.293 | -0.225 2.496 0.094 -5.295 |
## lemon -1.197 | -0.791 5.229 0.077 -4.806 |
## milk -1.797 | 0.892 12.721 0.212 7.956 |
## other -5.517 | 1.520 5.274 0.071 4.623 |
## how_bag 5.971 | 0.149 0.950 0.029 2.937 |
## how_both -11.046 | -0.314 2.354 0.045 -3.671 |
## how_loose 6.662 | 0.119 0.130 0.002 0.763 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.091 0.114 0.491 |
## How | 0.056 0.133 0.338 |
## how | 0.619 0.463 0.045 |
## price | 0.611 0.304 0.146 |
## sugar | 0.049 0.010 0.116 |
## where | 0.699 0.622 0.021 |
## breakfast | 0.002 0.098 0.158 |
# visualize MCA
plot(mca, invisible=c("ind"), habillage = "quali")
First we plot the frequencies of the variables. As we can see, most people use Earl Grey bag tea, drink it alone in a store (home excluded), from all kinds of price ranges, every now and then adding sugar. From the second plot we can see the illustration of the MCA model. We can see that exclusive group ( drinking green upscale loose tea in a tea shop) is more close to each other. The second group noting away from the main group is casual drinkers whom drink in a shop or store a variable price range bag or as loose tea added with a secret ingredient (other, possibly alcohol, I’m thinking about rum toddy here!). I excluded the other MCA graphs as they added little or no further information to the analysis.